ELLIOTT III , JAMES JOHN . Resilient Iterative Linear Solvers Running Through Errors
نویسندگان
چکیده
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direction of Frank Mueller.) Future extreme-scale computer systems may expose incorrect behavior to applications, in order to save energy or increase performance. However, resilience research struggles to come up with useful abstract programming models for reasoning about faults in applications. This work mainly focuses on silent soft errors, that is, errors that do not cause the system to halt and provide no indication that they occurred. The approach presented is not specific to silent soft errors; it is a general model for tolerating abnormal behavior in numerical algorithms. We present findings targeted at silent faults that impact the data used by an algorithm. Silent faults in data, which we refer to as silent data corruption (SDC), may lead to the algorithm operating on incorrect data and presenting invalid outputs. The overarching goal of this work is to ensure that we obtain valid solutions given soft faults. Existing work in algorithm fault tolerance randomly flips bits in running applications, but this only shows average-case behavior for a low-level, artificial hardware model. Algorithm developers need to understand worst-case behavior with the higher-level data types they actually use, in order to make their algorithms more resilient. Also, we know so little about how soft faults may manifest in future hardware, that it seems premature to draw conclusions about the average case. We argue instead that numerical algorithms can benefit from a numerical unreliability fault model, where faults manifest as unbounded perturbations to floating-point data. Algorithms can use inexpensive “sanity” checks that bound or exclude error in the results of computations. Given a selective reliability programming model that requires reliability only when and where needed, such checks can make algorithms reliable despite unbounded errors. Sanity checks, and in general a healthy skepticism about the correctness of subroutines, are wise even if hardware is perfectly reliable. c © Copyright 2015 by James John Elliott III
منابع مشابه
High Performance Linear System Solver with Resilience to Multiple Soft Errors
In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, with integrated circuit technology scaling below 65 nm, the critical charge required to flip a gate or a memory cell is dangerously reduced. Combined with higher vulnerability to cosmic radiation, soft errors are expected to become anything but inevitable for modern supercomputer syst...
متن کاملAn error-resilient redundant subspace correction method
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors...
متن کاملMonte Carlo linear solvers with non-diagonal splitting
Monte Carlo (MC) linear solvers can be considered stochastic realizations of deterministic stationary iterative processes. That s, they estimate the result of a stationary iterative technique for solving linear systems. There are typically two sources of errors: i) those from the underlying deterministic iterative process and (ii) those from the MC process that performs the estimation. Much rog...
متن کاملIterative linear system solvers with approximate matrix-vector products
There are classes of linear problems for which a matrix-vector product is a time consuming operation because an expensive approximation method is required to compute it to a given accuracy. One important example is simulations in lattice QCD with Neuberger fermions where a matrix multiply requires the product of the matrix sign function of a large sparse matrix times a vector. The recent intere...
متن کاملApplying Automated Memory Analysis to Improve Iterative Algorithms ; CU-CS-1012-06
Historically, iterative solvers have been designed so as to minimize the number of floating-point operations. We propose instead that iterative solvers should be designed to minimize the amount of data that must be loaded from the memory hierarchy to the CPU. In this paper, we describe automated memory analysis, a technique to improve the memory efficiency of a sparse linear iterative solver. O...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015